Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Blocked by #41] Version 2 #38

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

[Blocked by #41] Version 2 #38

wants to merge 2 commits into from

Conversation

AlexWayfer
Copy link

@AlexWayfer AlexWayfer commented Aug 31, 2018

Added

  • Built-in cache with file-path and max-age
  • Automatically refresh from source (external repository)

Removed

  • Rake-tasks
  • A very large copy of external UA-bots list
  • rack as dependency

Changed

  • voight_kampff/rack_request isn't required by default
  • Default path for cache is now in ./tmp directory instead of ./config

Development

  • EditorConfig added
  • RuboCop added and offenses fixed
  • Drop support for Ruby <= 2.2 (they even will not get security patches)
  • Ignore built gems for git

References

end

def load_remote_crawlers(uri = DEFAULT_URL)
crawlers = Net::HTTP.get(URI(uri)).force_encoding('UTF-8')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this should potentially fail gracefully if say the URL returns a non successful response code or any kind of network error? Potentially just Kernel.warn "VoightKampff: Unable to fetch crawlers from #{uri}"

I can see a scenario where the repo gets pulled and a bunch of sites go down.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thank you. I added tests and code for these situations.

Also, there is warn with error inspection: it can be SocketError, or Errno::ENETUNREACH, or something else.

And the code will "cache" the empty Array in these cases.

Also, I didn't add a code for redirects: Net::HTTP can't deal with them easily. It's more comfortable to use some external gem, like httpx, for this. But I hope that there will be not redirects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we did this at all, I'd want to be really paranoid about it. It should fail very gracefully and any sort of issues with this whole process should log an error, but never take down the whole site. And I'd prefer to avoid a situation where we need to give false positives too. If there are problems getting the file, I suppose you might need to treat everything as not a bot until the file did download. But that seems bad too.

At the very least, I think the gem should start off using it's local file thats packaged with the gem. Then only if a new version is downloaded successfully and the JSON is parsed successfully and then some sort of sanity check is done to make sure we have a real list of bots (even if that's just checking the number of bots in the file). Then and only then, should we start to use that new list.

@AlexWayfer
Copy link
Author

Reference to money-oxr gem added into commit message and PR description accordingly.

@adamcrown
Copy link
Collaborator

Wow, thanks for all the work on this @AlexWayfer! I'll do my best to get to reviewing it soon and see if we can get this merged. In the meantime, would you mind addressing the merge conflicts?

@AlexWayfer
Copy link
Author

AlexWayfer commented Sep 30, 2018

In the meantime, would you mind addressing the merge conflicts?

Done.


By the way, this version is already used in production, and looks working properly.

@AlexWayfer
Copy link
Author

I'll do my best to get to reviewing it soon and see if we can get this merged.

I don't want to rush, just a friendly reminder.

Added
-----

* Built-in cache with file-path and max-age
* Automatically refresh from source (external repository)

Removed
-------

* Rake-tasks
* A very large copy of external UA-bots list
* `rack` as dependency

Changed
-------

* `voight_kampff/rack_request` isn't required by default
* Default path for cache is now in `./tmp` directory instead of `./config`

Development
-----------

* [EditorConfig](https://editorconfig.org/) added
* [RuboCop](https://github.com/rubocop-hq/rubocop) added and offenses fixed
* Drop support for Ruby <= 2.2 (they even will not get security patches)
* Ignore built gems for git
@AlexWayfer
Copy link
Author

Conflicts resolved, RuboCop updated, speed test improved.

@adamcrown
Copy link
Collaborator

Sorry again for the delay. I've had a bit of a backlog with my open source projects ever since I started a job that gave me some time to work on them. But I'm trying to address at least one issue a week and this is currently next inline. So I should be able to get to it soon if nothing urgent comes up.

Copy link
Collaborator

@adamcrown adamcrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @AlexWayfer for all your work on this. There are some little things I'd like to see changed in a code review that I can add if you want, but before I got into those details I wanted to cover the big issue I'm seeing.

When I wrote this gem I decided to keep a local copy of the crawler-user-agents.json file for a reason. Here's some of the pros and cons I see with that approach.

Pros:

  • No external dependency, in general
  • No downtime for the client app when GitHub or it's internet connection is down
  • No problems if the external file or repo is deleted, moved, broken, etc.
  • No performance issues from an external file having to be downloaded at startup and then on a regular interval
  • No issues with firewalls or running the project offline

Cons:

  • You don't get the latest list of bots unless you schedule a rake task to download it
    There certainly could be other cons I'm not thinking of too. Feel free to list them.

I hope that's enough for you to see why I made the choice to not pull the file from GitHub by default. That being said, there is a lot of good work in this PR. So...

  1. I like a lot of the cleanup and improvements mixed in here. If you want to separate them out, I'd be happy to merge them.
  2. I wouldn't mind pulling in the code to automatically download and cache the crawler-user-agents.json file as long as we still keep the local copy included with the gem and use that by default. Then have a config options for "auto refresh".

Let me know how you'd like to proceed with this.

@@ -1 +0,0 @@
2.6.2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this file was removed, but I'd like to keep it around unless there is a good reason to remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gem supports many version of Ruby. Why we should lock at some specific version? It's like Gemfile.lock (https://stackoverflow.com/a/4151540/2630849), but Ruby-level.

end

def load_remote_crawlers(uri = DEFAULT_URL)
crawlers = Net::HTTP.get(URI(uri)).force_encoding('UTF-8')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we did this at all, I'd want to be really paranoid about it. It should fail very gracefully and any sort of issues with this whole process should log an error, but never take down the whole site. And I'd prefer to avoid a situation where we need to give false positives too. If there are problems getting the file, I suppose you might need to treat everything as not a bot until the file did download. But that seems bad too.

At the very least, I think the gem should start off using it's local file thats packaged with the gem. Then only if a new version is downloaded successfully and the JSON is parsed successfully and then some sort of sanity check is done to make sure we have a real list of bots (even if that's just checking the number of bots in the file). Then and only then, should we start to use that new list.

@AlexWayfer
Copy link
Author

No external dependency, in general

I don't see evil in dependencies. Our (commercial) projects are depended on this gem. Is it bad? I don't think so.

Especially, I can contribute to dedicated repo only new bots, not a code, and this immediately will be in our projects, without updates. Like, automated data updating. Like currencies exchange rates.

No downtime for the client app when GitHub or it's internet connection is down

It's a good point. This job (fetching data update) can be done in a separated thread and with retries. I can try to write a code for this here.

No problems if the external file or repo is deleted, moved, broken, etc.

The same as first: our project can have problems if this gem will be deleted, moved, broken, etc. This is compromise, we're taking these risks. And these potential problems are resolvable.

No performance issues from an external file having to be downloaded at startup and then on a regular interval

Ehm… OK. I think, it less than 10 seconds even with bad internet connection and large file, and for current interval (24 hours, that can be changed) it's like 0.01% of time, but OK. Again, with separated thread it can be better, I think.

No issues with firewalls or running the project offline

Firewall with blocking downloads from GitHub?… And offline of projects with gem for requests?… I think, these cases are very rare, and there can be another problems. But OK, you're right.

There certainly could be other cons I'm not thinking of too. Feel free to list them.

  • Many commits in gem with code but about data from external repo, so it's kind of duplicating.

I hope that's enough for you to see why I made the choice to not pull the file from GitHub by default.

Yes, of course.

Let me know how you'd like to proceed with this.

I can (try to) separate those changes, but I want to describe my vision of suggested approach:

As referenced in the PR description, it's like money-oxr for me. Gem for currencies exchange rates don't store a copy of all currency rates, which is (automatically) updating with interval. It's like… pointless. It (almost) requires regular updates from gem side, and duplicating a large massive of data. While this data is updating every day (or even maybe hour).

The data for this gem (crawlers list) is not too frequently updating, but for me the principle the same: separate code and data. Like you separate code from database, code from uploaded files, etc. You don't store them in repo with code and update regularly.

@AlexWayfer AlexWayfer changed the title Version 2 WIP: [Blocked by #41] Version 2 Aug 15, 2019
@AlexWayfer AlexWayfer marked this pull request as draft December 12, 2020 13:52
@AlexWayfer AlexWayfer changed the title WIP: [Blocked by #41] Version 2 [Blocked by #41] Version 2 Dec 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants